A GPU- accelerated Branch-and-Bound Algorithm 
for the Flow-Shop Scheduling Problem 



t— i ■ 
O 

<n: 

< 

o- 

(n: 



1/3 ■ 



> 
m 
m 

OV 

en 

00 

o 

(N 



X 



N. Melab* , I. Chakroun *, M. Mezmaz** and D.Tuyttens** 
* Universite Lille 1, LIFL/UMR CNRS 8022 
59655 - Villeneuve d'Ascq cedex - France 
Email: {nouredine. melab, imen. chakroun} @lifl.fr 
Mathematics and OR Department, University of Mons, Belgium 
Email: {mohand.mezmaz,daniel.tuyttens}@umons.ac.be 



Abstract — Branch-and-Bound (B&B) algorithms are time- 
intensive tree-based exploration methods for solving to optimality 
combinatorial optimization problems. In this paper, we inves- 
tigate the use of GPU computing as a major complementary 
way to speed up those methods. The focus is put on the 
bounding mechanism of B&B algorithms, which is the most 
time consuming part of their exploration process. We propose 
a parallel B&B algorithm based on a GPU-accelerated bounding 
model. The proposed approach concentrate on optimizing data 
access management to further improve the performance of the 
bounding mechanism which uses large and intermediate data 
sets that do not completely fit in GPU memory. Extensive 
experiments of the contribution have been carried out on well- 
known FSP benchmarks using an Nvidia Tesla C2050 GPU card. 
We compared the obtained performances to a single and a multi- 
threaded CPU-based execution. Accelerations up to x 100 are 
achieved for large problem instances. 

Index Terms — Massively Parallel Computing, GPU Computing, 
Branch-and-Bound Algorithms, Lower Bounding, Flow-Shop 
Scheduling. 

I. Introduction 

Combinatorial optimization problem^] are NP-hard and 
CPU-time intensive in practice. Branch-and-Bound (B&B) 
algorithms are efficient methods for solving to optimality those 
problems. Their execution consists in exploring a search space 
by dynamically building a tree whose root node is the original 
problem, the intermediate nodes are sub-problems, and the 
leaves are potential solution(s). B&B proceeds in several itera- 
tions during which the best solution found so far (upper bound) 
is progressively improved. During the exploration, a bounding 
mechanism, based on a lower bound function, is used to 
eliminate all the sub-problems (i.e. cut their corresponding 
sub-trees) that are not likely to lead to optimal solutions. Such 
powerful mechanism allows one to reduce significantly the size 
of the explored search space and thus its exploration time cost. 

However, even if such mechanism is highly efficient it 
is not sufficient to deal with large size problem instances. 
Over the last decades, parallel computing has emerged as an 

'An optimization problem consists in minimizing or maximizing a cost 
function. Without loss of generality, in this paper the minimization case is 
considered. 



attractive way to deal with larger instances. The design and 
implementation of parallel B&B is strongly influenced by the 
computing platform. Many contributions have been proposed 
for the design and implementation of parallel B&B methods 
using Massively Parallel Processors (MPP) (8), Networks or 
Clusters of Workstations (NOWs or COWs) Q and Shared 
Memory or SMP machines J9J. In this paper, we investigate 
the design of B&B algorithms on Graphics Processing Units 
(GPU). In combinatorial optimization, GPU computing is suc- 
cessfuly used for meta-heuristics (near-optimal methods) lfl4l 
but not yet for B&B exact methods. 

Most of existing parallel B&B algorithms are based on the 
parallel exploration of the search tree. Such parallel model is 
not suited to GPUs because the explored search tree is highly 
irregular. The best parallel model for B&B on GPU is the 
parallel evaluation of the lower bound function (thread kernel) 
on pools of sub-problems (parallel bounding). Such model 
must be rethought at design as well as at implementation level 
taking into account at the same time the characteristics of 
GPU accelerators and those of the lower bound computation 
function. On the one hand, a GPU is a many-core co-processor 
device that provides a hierarchy of memories having different 
sizes and access latencies making data placement and sharing 
challenging. On the other hand, the lower bound computation 
function is generally problem-dependent. In this paper, the 
focus is on the Flow-Shop scheduling permutation Problem 
(FSP) (see Section III-Bb . The lower bound function used in 
this work for FSP is that proposed in for two machines 
and generalized in O to more than two machines. The imple- 
mentation of such function makes use of six data structures of 
different sizes and access frequencies making data placement 
on GPU challenging. 

Preliminary experiments we have carried out on some 
Taillard's FSP instances (6) have shown that the time spent 
by B&B evaluating the lower bounds of the examined sub- 
problems is on average around 98.5% of its total execution 
time. Such result illustrates the potential benefit of paralleliz- 
ing the bounding operation. The major contribution of this 
paper consists in revisiting B&B to allow efficient solving 
of large FSP instances on GPU. Having in mind the char- 
acteristics of both the lower bound function and the GPU 



device mentioned above, the challenge is to define a new 
approach for optimal mapping of the data structures of the 
lower bound function on the hierarchy of memories provided 
in the GPU device. A careful analysis is required of both 
the data structures (size and access frequency) and the GPU 
memories (size and access latency). 

The remainder of the paper is organized as follows: Sec- 
tion |II]presents the B&B algorithm applied to the permutation 
FSP, the associated lower bound used in this paper, and its 
implementation and complexity analysis. In Section [Till we 
describe our GPU-based proposed approach for B&B applied 
to FSP. Details are given on the parallel approach and memory 
access optimization. In Section [IV] we report experimen- 
tal results demonstrating the efficiency of our approach. In 
Section [V] we compare the performance of the proposed 
approach to a multi-threaded CPU version of the B&B. Finally, 
some conclusions and perspectives of this work are drawn in 
Section IVTI 

II. B&B and Lower Bound for the Permutation FSP 

A. Parallel B&B algorithms 

Branch-and-Bound (B&B) algorithms are based on an im- 
plicit enumeration of the solutions composing the search space 
associated to the problem to be tackled. The search space 
is explored by dynamically building a tree whose root node 
designates the original problem. The construction of the B&B 
tree and its exploration are performed using four operators: 
branching, bounding, selection and elimination. The algorithm 
proceeds in several iterations during which the best solution 
found so far (upper bound) is progressively improved. The 
generated and not yet examined sub-problems are kept into a 
list initialized to the original problem. At each iteration, a sub- 
problem is selected from this list, according to some strategy 
(depth-first, best-first, . . .), using the selection operator. The 
branching operator performs its decomposition into other sub- 
problems. The bounding operator calculates a lower bound 
of each generated sub-problem. Each sub-problem having 
a lower bound greater than the upper bound is eliminated 
using the elimination operator, this means that it will not be 
decomposed. 

Existing parallel B&B algorithms are based on three parallel 
models proposed in (T): parallel application of the operators 
on the generated sub-problems (Type 1), parallel building and 
exploration of a B&B tree (Type 2), and parallel (cooperative 
or independent) building and exploration of several B&B trees 
(Type 3). We have later revisited these parallel approaches 
for large-scale computational grids |[T3l using Type 2 parallel 
model. Grid computing provides an impressive computing 
power to solve challenging instances in combinatorial opti- 
mization | IT] . However, computational grids providing a huge 
amount of resources are not easily available and accessible for 
any user. Recently, Graphics Processing Units (GPU acceler- 
ators) have emerged as a new popular support for massively 
parallel computing. Such resources supply a great computing 
power, are energy-efficient and unlike grids they are highly 
available every where: laptops, desktops, clusters, etc. In the 
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Fig. 1. The search tree generated and explored by a B&B algorithm for 
solving an FSP with 3 jobs. Nodes with a lower bound (LB) greater (resp. 
lower or equal) than the upper bound (UB) are pruned (resp. decomposed or 
branched). 



following, we revisit the Type 1 parallel model on GPU for 
solving Flow-Shop problems. 

B. B&B for the permutation FSP 

The general FSP 13) consists in scheduling a pool of n 
jobs on a set of m machines such that each of the jobs J±, 
J2, J n has to be processed on the machines Mi, M2, 
M m in that order. Job J,; (i = 1, 2, n) consists 
therefore of a sequence of m operations On, Oi2, ...Oi m ; 
Oik being the processing of J; on during an uninterrupted 
processing time pik- (k = 1, 2, m) can handle at 
most one job at a time. The objective is to find a processing 
order on each Mk such that the time (makespan) required to 
complete all jobs is minimized. If the problem is restricted 
to the minimization over all permutation schedules, meaning 
with the same processing order on each machine, the resulting 
problem is called the permutation Flow-Shop problem, which 
is the focus of this work. In the remainder of this paper, FSP 
designates a permutation FSP. 

For m = 2, an optimal schedule can be found in 0(n.logn) 
steps using Johnson's algorithm Q. For m > 3, the problem 
has been shown to be NP-hard |4j. Due to such complexity the 
enumerative solution approach provided in B&B algorithms is 
well-suited to solve the problem to optimality. As illustrated 
(for n = 3) in Figure [U the B&B enumeration scheme is 
based on a search tree whose root node contains the original 
problem (empty schedule). 

The decomposition of this problem generates n sons, each of 
them designates a sub-problem. The son number i represents 
the sub-problem in which job Ji is scheduled first on all 
machines. The recursive application of the decomposition 
operator on the generated sub-problems allows to develop the 
search tree. The number of potential schedules (permutations) 
is n\, which is highly exorbitant for large problem instances 
such as 200 x 20 (200! schedules!) Taillard's ones 0. There 
are two major powerful ways to speed up the exploration of 
large search trees. The first way consists in using an efficient 
bounding operator. Applied to a sub-problem, such operator 



associates a value to its corresponding tree node using a lower 
bound function. As illustrated in Figure Q] the sub-problem is 
not decomposed (and its tree node is pruned) if its lower bound 
value is greater than the cost of the best schedule found so far 
(called the upper bound) during the exploration of the search 
tree. The second way is to use massively parallel computing 
based on the three parallelism types presented in Section ITl-AI 
We recall that the focus of this paper is only on Type 1 i.e. 
the parallel evaluation of the lower bound on a pool of sub- 
problems. 

C. Lower Bound for FSP 

As quoted above, the objective (cost function) of FSP 
considered in this paper is the makespan C max , which rep- 
resents the completion time of the last scheduled job on 
the last machine. Given a sub-problem (partial schedule) 
7r = 7r(l), 7r(2), . . . , 7r(Z) indicating that J^U) occupies the i th 
position on each machine for i = 1, . . . , I. The sub-problem 
consists to find the optimal schedule of the n — I remaining un- 
scheduled jobs. Before solving such sub-problem, it is checked 
either or not the optimal solution of the original problem could 
be the optimal solution of this sub-problem. In other words, it 
is checked either the optimal solution of the original problem 
is probably in the sub-tree search space associated to that sub- 
problem or not. This is the role of the bounding operator 
which uses a lower bound function to prune nodes and the 
sub-trees they are root of. Indeed, if the lower bound value of 
the sub-problem is greater than the upper bound found so far 
the sub-problem is not decomposed/branched because it is sure 
that the optimal solution is not located in its sub-tree search 
space. This allows to significantly reduce the exploration time 
of the B&B tree. Therefore, the efficiency of a B&B algorithm 
depends strongly on the quality of its lower bound function. 
In this paper, we use the lower bound proposed by Lenstra et 
al. J3) for FSP, based on the Johnson's algorithm Q. 

D. Complexity analysis and implementation 

For an efficient implementation of the lower bound LB 
algorithm, six data structures are required: the matrix PTM 
of the processing times of the jobs, the matrix of lags 
LM, the Johnson's matrix JM, the matrix RM of the 
earliest starting times of jobs, the matrix QM of their lowest 
latency times and the matrix MM containing the couples 
of machines. In the LB expression, the computation of the 
term Pj a (j, A/jt, Mi) requires the calculation of the lag of 
each remaining job to be scheduled on the couple (Mfc,M;) 
of machines using its processing times on these machines 
(Johnson's rule with lags). Such computation is repeated 
for each couple (Mfc,Mj) of machines with 1 < k, I < m 
and k < I. To avoid the repetitive computation of the lags, 
they are computed once at the beginning of the algorithm 
and stored in the matrix LM. The dimension of LM is 
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m x (m— 1) 



where n and m are respectively the number 



Matrix 


Size 


Number of accesses 


PTM 


n X m 


n' X m X (m — 1) 


LM 


m X (m — 1) 

n X \ 


/ w m X ( m — 1 ) 
n X ^ '- 


JM 


nx 


m X (m — 1) 

nx \ 


RM 


m 


m X (m — 1) 


QM 


m 


m X (m — 1) 

2 


MM 


m. X (m — 1) 


m X (m. — 1) 



TABLE I 

The different data structures of the LB algorithm and their 

ASSOCIATED COMPLEXITIES IN MEMORY SIZE AND NUMBERS OF 
ACCESSES. THE PARAMETERS n, m AND n' DESIGNATE RESP. THE TOTAL 
NUMBER OF IOBS, THE TOTAL NUMBER OF MACHINES AND THE NUMBER 
OF REMAINING JOBS TO BE SCHEDULED FOR THE SUB-PROBLEMS THE 
LOWER BOUND IS BEING COMPUTED. 



(01) 
(02) 
(03) 
(04) 
(OS) 
(06) 

(07) 

(08) 
(09) 
(10) 
(11) 
(12) 
(13) 
(14) 
(15) 

(16) 
(17) 
(18) 
(19) 
(20) 
(21) 
(22)} 



int computeLB(){ 
LB=maxInteger; 
for (index=0;index< mx( "~ 



-;index++){ 



Ml=MM[index][0]; 
M2=MM[index][l]; 
timeOnMl= min (RM[Ml][j]); 

0<j<n 

timeOnM2= min (RM[M2][j])|] 

0<j<n 

for (i=0;i<n;i++){ 
job=JM[i][index]; 
if (job not yet scheduled)^ 

timeOnM 1 =timeOnM 1 +PTM[j ob] [Ml]; 
if ( timeOnM2> timeOnM 1 +LM [job] [index] ) ( * ) 
timeOnM2+=PTM [j ob] [M2] ; 

else 

timeOnM2=timeOnM 1 +LM[job] [index]+ 
PTM[job][M2]; 

} 

} 

timeOnM2+= min (QM[M2][j]); 

0<i<n 

LB=max(timeOnM2,LB); 

} 

return LB; 



Fig. 2. Pseudo-code implementing the LB function 



of jobs to be scheduled and m the number of machines. LM 
is accessed n' x mx (^~ 1 ) times, n' being the number of 
remaining jobs to be scheduled in the sub-problem for which 



the lower bound is being calculated. The processing times 
of all the jobs on all the machines are stored in the matrix 
PTM. This matrix has a dimension of n x m and is accessed 
n' X to X (m — 1) times. 

Table [TJ is highly needed to understand the proposed data 
placement approach. The columns of Table Q] represent re- 
spectively the name of the data structure, its size and the 
number of times it is accessed. Figure [2] shows the pseudo- 
code implementing the LB lower bound function illustrating 
the access to the six data structures. 

III. GPU-BASED B&B FOR FSP - A NEW APPROACH 

As said previously, the time complexity of the Johnson 
algorithm for two machines is 0(n.logn). Therefore, the 



time complexity of the lower bound LB for m machines 
is 0(m 2 .n.logn). The computation of the lower bound is 
consequently time intensive especially for problem instances 
for which m is high. In order to evaluate experimentally 
its CPU time cost, we have implemented this lower bound 
and experimented it using the most time-consuming Taillard's 
instances J6l i.e. having to = 20. The results show that the 
time spent by the B&B evaluating the lower bounds of the 
examined sub-problems is on average around 98.5% of its total 
execution time. Such result demonstrates that the bounding 
must be parallelized i.e. the LB lower bound function must be 
applied in parallel to each sub-problem composing the pool of 
sub-problems currently examined. In the following, we present 
a new GPU-based approach for the parallel evaluation of the 
lower bound in B&B algorithms. We first present the parallel 
GPU-based approach. Then, we show how our approach maps 
the different data structures on the memory hierarchy of the 
GPU device taking into account the characteristics of the data 
structures presented in Table U and those of the different GPU 
memories (size and access latency). 

A. The GPU-based parallel evaluation of LB 

The GPU-accelerated approach is based on the GPGPU 
(CUDA or OpenCL) parallel paradigm according to which 
the programmer writes a serial program that calls parallel 
kernels (simple functions or full programs). A kernel executes 
in parallel across a set of parallel threads. The programmer 
organizes these threads into a hierarchy of grids of thread 
blocks. A thread block is a set of concurrent threads that 
can cooperate through barrier synchronization and shared 
access to a memory space private to the block. A grid is a 
set of thread blocks that may be executed independently in 
parallel. When invoking a kernel, the programmer specifies 
the execution configuration. Such configuration includes the 
number of threads per block and the number of blocks making 
up the grid. 

In our proposed GPU-based approach, the generation of 
the sub-problems (elimination, selection and branching oper- 
ations) to be solved is performed on CPU and the evaluation 
of their lower bounds (bounding operation) is executed on 
the GPU device. As illustrated in Figure [3] the pool of sub- 
problems generated on CPU (and selected according to the 
well-know best-first strategy) is off-loaded to the GPU device 
to be evaluated by a pool of threads partitioned into blocks. 
Each thread applies the lower bound function (kernel) to one 
sub-problem. Once the evaluation is completed, the lower 
bound values of the different sub-problems are returned back 
to the CPU to be used by the elimination operator to decide 
either to be pruned or to be decomposed. The process is 
iterated until the exploration is completed and the optimal 
solution is found. 

B. Data access optimization 

During their execution, threads may access data from multi- 
ple memory spaces having different sizes and access latencies. 
At the thread-level, each thread has its own allocated registers 
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Fig. 3. GPU evaluation of sub-problems: generation on CPU and evaluation 
on GPU. 



and a private local memory. CUDA 1(171 uses this local mem- 
ory for thread-private variables that do not fit in the threads 
registers, as well as for stack frames and register spilling. 
At the thread block-level, each thread block has a shared 
memory visible to all its associated threads. At the grid-level, 
all threads have access to the same global memory. Texture and 
constant cached memories are two other memories accessible 
by all threads. The data access optimization challenge is to 
find the best mapping of the data structures of the application 
at hand (different sizes and access frequencies) and the GPU 
hierarchy of memories (different sizes and access latencies). 
For instance, the global memory is large in size but has a high 
access latency. On the contrary, shared memory is smaller in 
size but has a lower access latency. 

For B&B applied to FSP, threads of the same block perform 
concurrent accesses to the six data structures of the problem 
when they execute the LB lower bound function. To optimize 
the performance of such application, the best mapping of the 
data structures is to copy them on the shared memory of the 
GPU device. However, for large problem instances all of the 
data structures do not fit into the shared memory which size 
is limited and depends on the GPU hardware configuration. 
The challenge is therefore to decide which data structure must 
be put in the shared memory to get the best performance. 
The answer is given in the next section according to the 
complexity analysis presented in Table U and the underlying 
GPU configuration of our experiments. 

IV. Experiments 

To evaluate the performance of our GPU-based B&B al- 
gorithm and parallel bounding approach, we have considered 
the largest Taillard's FSP benchmarks proposed in J6), except 
those with 500 jobs because they do not fit in the memory of 
the CPU. The different instances are designated by n x to, 
where n and to represent respectively the number of jobs 
(between 20 and 200) to be scheduled and the number of 



machines (20) target of the scheduling. The GPU-based B&B 
has been implemented using C-CUDA 4.0, and compiled using 
nvcc. The experiments have been carried out on an Intel 
Xeon E5520 paired with a GPU device. E5520 is 64-bit and 
composed of two quad-core chips, and has a clock speed of 
2.27GHz. The GPU device is an Nvidia Tesla C2050 with 448 
CUDA cores (14 multiprocessors with 32 cores each), a clock 
speed of 1.15GHz, a 2.8GB global memory, a configurable 
shared memory (16 KB or 48 KB) and a warp size of 32. 

In the following, an experimental study is presented with 
the objective to evaluate the performance impact of the GPU- 
based parallel evaluation of the lower bound, and the data 
access optimization. For each, we present the objectives of the 
experiments and report the obtained results. Two parameters 
are considered: the problem instances (n x to) (as rows in 
the tables and x-axis in the graphics) and the size of the 
pool of sub-problems to be evaluated (as columns in the 
tables and x-axis in the graphics). The first parameter gives 
information on the granularity of the thread computations. 
As the complexity of the computation of the lower bound is 
0(m 2 .n.logn), for large problem instances (i.e. large values 
of n and m) the grain size of the kernel executed by each 
thread is much higher. Moreover, the first parameter gives 
information on the size of the data structures to be mapped 
on the GPU memories. This is highly important for the 
study of the data access optimization approach. The second 
parameter is designated in the different experimental results 
by pool size (block size x number of threads /block). 
This parameter is useful to get information on the time cost 
of the data transfer between CPU and GPU and on the total 
number of threads to be triggered on GPU. 

For each pair of values associated to the two parameters, 
each table/graphics reports the corresponding parallel effi- 
ciency. Since the used instances are very hard to solve (optimal 
solutions for many of these instances are still not known), 
we used the approach defined in ifTTl to run experiments. 
Employing this method allows to obtain a random list L of 
subproblems such as the resolution of L lasts Tcpu minutes 
with a sequential B&B. To ensure that the subproblems 
explored by the GPU and CPU B&B versions are exactly the 
same, we initialize the pool of our GPU-based B&B with the 
same list L of subproblems used in the sequential version. If 
we suppose the resolution of the GPU-based B&B last Tgpu 
minutes, the parallel efficiency would be the ratio Tcpu /Tgpu: 
the execution time of the serial B&B on a single CPU core 
(without GPU) over the execution time of our GPU-based 
B&B on a CPU core coupled with a GPU device. 

A. Performance impact of GPU-based parallelism 

First, the objective of the experimental study presented in 
this section is to demonstrate that our GPU-based B&B allows 
one to significantly accelerate the resolution process whatever 
is the FSP instance. However, the best achieved acceleration 
depends strongly on the problem instance being solved and 
the size of the pool of sub-problems considered at execution. 
The second objective is therefore to exhibit the behavior of the 



GPU acceleration according to the tackled problem instance 
and the considered pool size. More exactly, the goal is to 
find for each problem instance the best pool size required to 
maximize the benefit taken from the use of the GPU device. 

The results reported in Table [TT] are obtained without any 
data access optimization. The six matrices are generated on 
CPU and then copied to the GPU global memory. The size 
of the thread blocks is experimentally fixed to 256 threads. 
Average accelerations of x44, 52 to x60, 64 and picked at 
x 77, 46 are achieved. In addition, the improvement of the 
parallel efficiency from a pool size of 4096 (16 x 256) to 
8192 (32 x 256) is significant. The reason is that the number 
of blocks (16) for the first pool size is not sufficient to get 
a better acceleration. Indeed, it is known that the number of 
blocks must be fixed at least to the double (14 x 2 = 28 for 
the C2050 GPU card) of the number of multi-processors of 
the target GPU device. Furthermore, for 50 x 20 and 20 x 20 
problem instances the best parallel efficiency is achieved for 
a pool size of 8192. For larger instances i.e. 100 x 20 and 
200 x 20, it is obtained with a pool size of 262144. These two 
pool size values correspond exactly to the two sizes of the 
pool for which the best ratio between lower bound evaluation 
time on CPU of the pool and its total communication time 
from CPU to GPU and from GPU to CPU. 

B. Data access optimization 

The objective is here to find the best mapping of the six 
data structures of the lower bound LB kernel on the memories 
of the GPU device. As quoted in Section UlI-BI such mapping 
depends on the sizes and access latencies/frequencies of these 
data structures and the GPU memories. The focus is put on 
the shared memory which is a key enabler for many high- 
performance CUDA applications. We also take care of ade- 
quately using the global memory by judiciously configuring 
the LI cache that greatly enables improving performance over 
direct access to global memory. Indeed, the GPU device we 
are using in our experiments is a C2050 Tesla (see UVb which 
a device based on the NVIDIA Fermi architecture. In the 
Fermi architecture, each multiprocessor of the GPU device is 
provided with a 64 KB local storage that can be configurable 
into shared memory and LI cache. For this reason and in 
order to achieve further performances, we divided the 64 KB 
memory according to the scenario we are experimenting. For 
the scenario were the data structures are put on the shared 
memory the 64 KB of available storage are split on 48 KB 
for shared memory and 16 KB for LI cache. For the scenario 
where the data sets are put on global memory we used 16 KB 
for shared memory and 48 KB for LI cache. 

As far as the data structures of the lower bound function 
are concerned, their complexities in terms of size and access 
frequency are reported in Table|I](see Section lTl-Dt . According 
to Table U RM, QM and MM have a small size, so their 
storage in the shared memory allows a very poor performance 
improvement. Therefore, whatever is the memory to which 
they are off-loaded, the performance impact is not significant. 
However, for large FSP instances (with n — 200), the total 



Problem instance 


4096 


8192 


16384 


32768 


65536 


131072 


262144 




16x256 


32x256 


64 x 256 


128x256 


256x256 


512x256 


1024x256 


200x20 


46,63 


60,88 


63,80 


67,51 


73,47 


75,94 


77,46 


100x20 


45,35 


58,49 


60,15 


62,75 


66,49 


66,64 


67,01 


50x20 


44,39 


58,30 


57,72 


57,68 


57,37 


57,01 


56,42 


20x20 


41,71 


50,28 


49,19 


45,90 


42,03 


41,80 


41,65 


Average Speedup 


44,52 


56,99 


57,72 


58,46 


59,84 


60,35 


60,64 



TABLE II 

Parallel efficiency for different problem instances and pool sizes. All the matrices JM, PTM, LM, RM, QM and MM are 

LOCATED IN THE GPU GLOBAL MEMORY. 



amount of memory required to store the other data structures 
i.e. JM and LM (38KB each) and PTM (4KB) is 80KB, 
which is greater than the available shared memory space 
(48KB). Therefore, only two of them can be put in the shared 
memory. LM has a double memory size than JM, and its 
access frequency is much lower, so it is better to map JM 
on the shared memory. Furthermore, PTM has the same 
access frequency than JM but requires less memory space. 
Consequently, the focus is put on the study of the performance 
impact of the placement of JM and PTM on the shared 
memory. PTM and JM are stored in shared memory and all 
others are placed on global memory. 

Table Hill reports the behavior of the parallel efficiency aver- 
aged on the different problem instances (sizes) as a function of 
the pool size. The table shows that the parallel efficiency grows 
on average with the growing of the pool size in the same way 
as in Table [II] For instance, for the largest problem instance 
and pool size, the parallel efficiency grows up to from x 77, 46 
(PTM and JM in global memory) to x 100, 48 (PTM and 
JM in shared memory) (23%). 

Figure |4] depicts the behavior of the parallel efficiency for 
the different problem instances (sizes). The pool size is fixed 
to 262144 (1024 x 256). According to the graphics, first, the 
efficiency is improved for all instances and the improvement 
is more significant for large problem instances. Second, the 
behavior of the efficiency improvement is not the same if 
shared memory is used or not. Indeed, according to the CUDA 
GPU occupancy calculator the size of the shared memory 
occupied by the data structures limits the number of active 
thread warps to 32 for 20 x 20 and 50 x 20 problem instances, 
and to 16 for 100 x 20 and 200 x 20 problem instances. When 
only global memory is used, the improvement is linear and the 
slop remains the same as the number of active thread warps 
remains the same (32) whatever is the problem instance. The 
only limiting factor of the active thread warps is the number 
of registers which is 26 in our case. In this case, the size of 
the occupied shared memory is lower and is not a limiting 
factor for the occupancy or number of active threads. On the 
other hand, when shared memory is used the slope of the 
efficiency improvement is much higher from 20 x 20 to 50 x 20 
(small data structures) than from 100 x 20 to 200 x 20 (large 
data structures). The reason is that according to CUDA GPU 
occupancy calculator in addition to the number of registers the 
size of the occupied shared memory is also a limiting factor 
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Fig. 4. Average parallel efficiency for different problem instances: PTM 
and JM are put together in the shared memory, the pool size is fixed to 
1024 X 256. 



of thread occupancy and thus parallel efficiency. 

V. Performances comparison with a 

MULTI-THREADED PARALLEL B&B ALGORITHMS 

With the advent of multi-core processors and their promised 
enhancement in software development performances, the use 
of multi-core processors for designing parallel algorithms 
become highly widespread. Unlike distributed computing sys- 
tems, one of the advantages of multi-core systems is the 
possibility to parallelize the algorithm using threads instead of 
processes. Indded, while processes in the same machine have 
their own virtual memory, threads of a process share the same 
virtual memory which significantly impact the performances. 

Several implementation of a multi-threaded B&B have been 
proposed in previous research works IflOl , l9l , |fI31 , Ifl6| . 
These multi-threaded B&B algorithms can be classified into 
two categories: low and high-level. In a low-level multi- 
threaded B&B, a low-level thread model such as POSIX 
Threads is used fl2l . ||9l while in a high-level multi-threaded 
B&B a high-level thread model such as OpenMP (2| is used. 

In order to further evaluate the performances of the proposed 
GPU-based B&B algorithm, we compare it to a low-level 
multi-threaded B&B J9] designed on top of a multi-core 
system, using the POSIX Threads library. 
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TABLE III 

Parallel efficiency for different FSP instances and pool sizes obtained with data access optimization. PTM and JM are placed 

TOGETHER IN SHARED MEMORY AND ALL OTHERS ARE PLACED IN GLOBAL MEMORY. 



In order to perform a fair comparison with the obtained 
results of our GPU-based approach, the used multi core system 
must have the same computational power in term of theoretical 
peak of floating-point operations per second. The floating- 
point operations per second (FLOPS) is a common measure 
of a computer's performance, especially in fields of scientific 
calculations. Indeed, FLOPS is a good indicator to measure 
performance on digital signal processing, scientific simula- 
tions, etc. It is particularly used in supercomputer ratings, like 
TOP500 1221 . 

As quoted in IIVI the experiments have been carried out 
on an Nvidia Tesla C2050. According to its constructor 
NVIDIA fl"8l . the theoretical double precision floating-point 
performance peak of this GPU device is about 515 GFLOPS. 
For the multi-threaded version of the B&B we have carried 
out experimentation on an Intel Core i7-970 Processor which 
is 64-bit and composed of six physical cores and 12 threads 
[21 1 having each a theoretical double precision floating-point 
performance peak of 76.8 GFLOPS |2UI . 

Table [IV] reports the speedup of the parallel multi-threaded 
B&B averaged on the different problem instances (sizes). The 
columns correspond to the number of parallel running B&B 
process and the corresponding theoretical peak of GFLOPS. 
The rows correspond to the problem instances defined by 
(Number of jobs x Number of machines). The same exper- 
imental protocol as the for GPU computation is used (see 
section IIVI ). The reported speedups are calculated relatively 
to a serial B&B on a single CPU core. Results shows that the 
parallel efficiency grows on average with the growing of the 
number of computing core used. However, the improvement 
is not linear and the slop decrease as long as the number of 
the used computing core raises. This behavior might be due 
to the operating system which handles additional page faults 
and context switches when the number of threads increases. 

Figure [5] shows the comparison between the obtained 
speedups with our GPU-based B&B and the multithreaded- 
based B&B. The speedups are calculated relatively to the same 
sequentiel version of the B&B algorithm. For a same compu- 
tational power, our approach for designing B&B algorithms 
on top of GPU accelerators is much more efficient than the 
multi-threaded B&B whatever the instance is. Indeed, for a 
computational power around 500 GFLOPS, the acceleration 
calculated when using the GPU-based B&B for the instances 
20 jobs over 20 machines is x 6 1,47. For the same category 
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Fig. 5. Comparison between the average parallel efficiency for different 
problem instances obtained with a GPU and a multithreaded-based B&B for 
a same computational power (500 GFLOPs). 



of instances (20 jobs over 20 machines) and a same com- 
putational power of 500 GFLOPS which corresponds to 7 
CPU computing cores for the Intel Core i7-970 Processor, 
the speedup over a sequential version of the multi-threaded 
based B&B is x9,22. Results show also that parallel efficiency 
for the GPU-based approach increases with the size of the 
problem being tackled while it is almost the same for the multi- 
threaded based algorithm. This is due to the complexity of 
the computation of the lower bound which is 0(m 2 .n.logn). 
When the size of the problem instance (i.e. large values of 
n and m) increases, the grain size of the kernel executed 
by each thread becomes higher which significantly increases 
the GPU throughput. For instance, for the problems of the 
category 200 jobs over 20 machines, the reported speedup of 
our approach is about x 100,48 while the speedup calculated 
for the multithreaded version is x8,76 which corresponds 
to an improvement of x 11,47. Over all the experimented 
instance categories, the GPU-based B&B run faster than the 
multithreaded-based B&B. 

VI. Conclusion and Future Work 

In this paper, we have revisited the parallel B&B algo- 
rithm for solving permutation-based combinatorial optimiza- 
tion problems such as FSP on GPU accelerators. The contri- 
butions consist in proposing: (1) a GPU-based parallel design 
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TABLE IV 

Parallel efficiency for different problem instances using the multi-threaded based B&B. 



and implementation of the parallel bounding model ; (2) a data 
access optimization approach to take into account the memory 
constraints of the GPU device. The Flow-Shop scheduling 
problem has been considered as a case study together with the 
Johnson's lower bound (__, extended in __| to more than two 
machines. The proposed approaches have been experimented 
using a Tesla C2050 GPU card on 4 different classes of FSP 
instances. 

In our proposed GPU-based approach, the decomposition 
and pruning of the sub-problems is performed on CPU and 
the evaluation of their lower bounds (bounding operation) is 
executed on GPU. Pools of sub-problems are off-loaded from 
CPU to GPU to be evaluated by blocks of threads. After 
evaluation, the lower bounds are returned back to the CPU. 
The experimental results show that accelerations up to x77 
can be obtained especially for large problem instances and 
large pools of sub-problems. As shown in the reported results 
the pool size that enables to achieve the best acceleration of 
the bounding mechanism depends strongly on the size of the 
problem instance being solved. Therefore, this parameter has 
to be determined at runtime by testing different pool sizes. 

The proposed data access optimization is based on a pre- 
liminary analysis of the lower bound function. Such analysis 
allowed us to identify six data structures for which we have 
proposed a complexity analysis in terms of memory size 
and access frequency. Due to the limited size of the shared 
memory the matrices do not fit in all together. According to the 
complexity study, the recommendation is to put in the shared 
memory the Johnson's and the processing time matrices (JM 
and PTM) if they fit in together. The other data structures 
are mapped to the global memory combined with the LI cache 
(see llV-BV Such recommendation has been confirmed through 
extensive experiments using the Taillard's benchmarks of the 
Flow-Shop problem and a recent C2050 Tesla GPU card. The 
optimizations obtained with the proposed approaches allowed 
us to achieve accelerations up to xlOO compared to a single 
CPU-based B&B and up to x 1 1 compared to a multi-threaded 
CPU-based execution. 

We are currently investigating the combination of the GPU- 
based bounding model with the multi-core parallel search tree 
exploration for the design and implementation of a GPU- 
accelerated multi-core B&B algorithm. In the near future, 
we plan to extend this work to a cluster of GPU-accelerated 
multi-core processors. From application point of view, the 
objective is to solve to optimality challenging difficult and 



unsolved Flow-Shop instances as we did it for one 50 x 20 
problem instance using grid computing ifTTI . Finally, we plan 
to investigate other lower bound functions to deal with other 
combinatorial optimization problems. 
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